After completing this course I will understand the principles and advantages of using open research tools with open data and understand the possibilities of reproducible research. I will know how to use R, R Studio, R markdown, and GitHub for these tasks and also know how to learn more of these open software tools. I will also know how to apply certain statistical methods of data science, that is, data-driven statistics.
The data was collected for studying a relationship between academic achievement and different factors. The most interesting ones are about studying behavior.
The data was collected during the second part of the course Introduction to Social Statistics, fall 2014, held by Kimmo Vehkalahti at the University of Helsinki, Finland.
(n=183, 67% female, mean age 25.6 years, 77.6% social science students)
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 7
The data contains 166 observations and seven variables.
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The scatter plot doesn’t reveal much, except that there is a pretty high correlation between attitude and exam points (0.437) compared to everything else. We’ll take a look at that later.
summary(learning2014$gender)
## F M
## 110 56
summary(learning2014$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 21.00 22.00 25.51 27.00 55.00
summary(learning2014$attitude)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.400 2.600 3.200 3.143 3.700 5.000
summary(learning2014$deep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 3.333 3.667 3.680 4.083 4.917
summary(learning2014$surf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.583 2.417 2.833 2.787 3.167 4.333
summary(learning2014$stra)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.250 2.625 3.188 3.121 3.625 5.000
summary(learning2014$points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 19.00 23.00 22.72 27.75 33.00
The goal is to find a correlation between exam points and some variables. I picked age, attitude and strategic learning.
model <- lm(points ~ age + attitude + stra, data=learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## age -0.08822 0.05302 -1.664 0.0981 .
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Age does not seem to fit in the model very well. I expected using strategic learning to have some effect on the points achieved, but it turned out to be not statistically significant. Only attitude seems to be relevant, so that’s what we’ll use in our next and final model:
model <- lm(points ~ attitude, data=learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Residuals tell the difference between observed values and the estimated values by the model. The median residual is pretty low at 0.4, but the minimum residual is pretty high. It tells us that the worst observed exam points were expected 17 points higher by the model. The same happens with maximum residual: the highest points were expected to be ten points lower.
The t value tells us the t test statistic. Higher t values are more significant to the model, but alone it doesn’t mean much, because it depends on the degrees of freedom. However the column Pr(>|t|) with the three stars (***) tells us that it’s statistically significant.
The multiple R-squared is 0.195 (19.5%). It tells how close the data is to the fitted regression line. It goes up with more variables, so in this case it goes down from dropping the two insignificant variables.
plot(model, which=c(1))
Residuals vs fitted is used to detect non-linearity. If there would be visible patterns in our line, there could be a problem with our model. Our line is pretty straight, so our model seems fine.
plot(model, which=c(2))
Q-Q plot is used to check if the residuals of the model are normally distributed. Ideally all the points would be aligned to the dotted line. With our model the line from the points is a bit curved, but still pretty close to the ideal line.
plot(model, which=c(5))
This plot tells if there are influential cases which could affect the regression line. In our model there seems to be none, since there even aren’t any Cook’s lines visible in the plot.
The data includes data about the alcohol use of some students at the University Of Camerino, Italy. It has been joined from two courses, the Portuguese language course and the math course.
There is a lot of background information about the participants, for example their father’s and mother’s jobs, travel times to school and such.
Obligatory citation from the original data: Using Data Mining To Predict Secondary School Student Alcohol Consumption. Fabio Pagnotta, Hossain Mohammad Amran Department of Computer Science,University of Camerino
colnames(alcohol_use)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
dim(alcohol_use)
## [1] 382 35
The data contains 382 observations and 35 variables.
I’m expecting lower study time to mean a higher alcohol consumption.
The codes you will see later have the following meaning:
1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours
I’m expecting people living in cities to use more alcohol. Living areas are coded as follows: ‘U’ - urban or ‘R’ - rural.
I’m expecting men to have a higher alcohol use.
I’m expecting going out more to increase alcohol use. The answers range from 1 (very low) to 5 (very high).
ggplot(alcohol_use, aes(x = studytime, fill = high_use)) + geom_bar(position = "fill") + ggtitle("Study time vs alcohol use")
Spending less time on studies seems to have an increasing effect on alcohol consumption.
ggplot(alcohol_use, aes(x = address, fill = high_use)) + geom_bar(position = "fill") + ggtitle("Living area vs alcohol use")
I was expecting students living in urban areas (U) to drink more, but it seems like rural (R) students drink more. The difference is not that great, but that’s something to look at.
ggplot(alcohol_use, aes(x = sex, fill = high_use)) + geom_bar(position = "fill") + ggtitle("Sex vs alcohol use")
It seems like men are more likely to consume more alcohol than their female counterparts.
ggplot(alcohol_use, aes(x = goout, fill = high_use)) + geom_bar(position = "fill") + ggtitle("Going out vs alcohol use")
Looks like high alcohol use rises with the frequency of going out pretty significantly.
m <- glm(high_use ~ studytime + address + sex + goout, data = alcohol_use, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ studytime + address + sex + goout, family = "binomial",
## data = alcohol_use)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7687 -0.7949 -0.4701 0.7707 2.4382
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1747 0.6014 -3.616 0.0003 ***
## studytime -0.5060 0.1699 -2.978 0.0029 **
## addressU -0.7734 0.3021 -2.560 0.0105 *
## sexM 0.6512 0.2638 2.469 0.0136 *
## goout 0.7730 0.1196 6.462 1.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 383.57 on 377 degrees of freedom
## AIC: 393.57
##
## Number of Fisher Scoring iterations: 4
Living area and sex have some effect, but not as much as the frequency of going out or time spent on studying. Going out seems to have the most effect.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1136447 0.03397617 0.3612755
## studytime 0.6028828 0.42771567 0.8343182
## addressU 0.4614355 0.25463666 0.8355741
## sexM 1.9179216 1.14615394 3.2310521
## goout 2.1661707 1.72470400 2.7596110
The odds ratios seems to indicate that being male and going out give the highest odds of experiencing a high alcohol consumption. The confidence values for them are high though. Judging by them it seems like time spent on studying and living in urban areas would make high alcohol use more unlikely than being a male and going out more to make it more likely.
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alcohol_use'
alcohol_use <- mutate(alcohol_use, probability = probabilities)
# use the probabilities to make a prediction of high_use
alcohol_use <- mutate(alcohol_use, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alcohol_use$high_use, prediction = alcohol_use$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66753927 0.03926702 0.70680628
## TRUE 0.18324607 0.10994764 0.29319372
## Sum 0.85078534 0.14921466 1.00000000
The model gives false positives at a 4% rate and true negatives at a 18% rate. The interesting thing is that actual positives are predicted at a lower rate than negatives that are actually positive.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alcohol_use$high_use, prob = alcohol_use$probability)
## [1] 0.2225131
cv <- cv.glm(data = alcohol_use, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2329843
22.2% error on model, ~23% error on the cross validation depending on the run. It would seem like my model has a better performance compared to the model in DataCamp (26%).
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alcohol_use, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
We are going to use the Boston dataset included in the package “Mass”. The data deals with housing values in the suburbs of Boston. It includes some background data, for example the average rooms in the dwelling and the distance to Boston employment centres. Crime rate will be the main variable in this analysis, since that’s the target variable in most of the exercises.
The variables are covered in detail here for the interested.
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The scales seem to vary considerably between the variables. For example tax has a maimum value of 711.0, while nox has a maximum value of 0.8710. That is going to be a problem when we’re trying to categorize the data. We’re going to deal with that issue soon.
correlation_matrix <- cor(Boston) %>% round(digits = 2)
corrplot(correlation_matrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The bigger the circle and the more colorful, the more correlation between the variables. For example, there seems to be a strong positive correlation between property-tax rate (tax) and accessibility to radial highways (rad).
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
After standardizing all the means of the variables become 0. It also makes every variable to have the same scale, so it will be easier to cluster them.
# Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset.
# save the scaled crim as scaled_crim
scaled_crim <- scale(boston_scaled$crim)
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
# create a categorical variable 'crime'
labels = c("low", "med_low", "med_high", "high")
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = labels)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col = color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col = color, pos = 3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
The plot seems to indicate that there’s a strong correlation between high crime rate and accessibility to radial highways. There’s some clear clustering around the medium high category at the bottom left of the plot as well. Medium lows and lows are a bit mixed, but there is some clear clustering among them as well, although not as much as among the others.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 8 2 0
## med_low 3 17 8 0
## med_high 0 10 16 2
## high 0 0 0 25
The model seems to predict high crime rate perfectly. Low and medium crime rates seem to fit in their places as well, but there’s more variance among them.
The same effect can be seen from the previous plot. All the high rates are at the far right of the plot while the other values are clustered together at the left side.
boston_scaled <- as.data.frame(scale(Boston))
distances <- dist(boston_scaled)
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(distances, k)$tot.withinss})
# visualize the results
plot(1:k_max, twcss, type='b')
The “elbow” is at 2, so that’s what we should use for our number of centers, and that’s what we will use.
# k-means clustering
km <- kmeans(distances, centers = 2)
boston_scaled$km <- as.factor(km$cluster)
# plot the Boston dataset with clusters
ggpairs(boston_scaled, ggplot2::aes(colour = km), upper = list(continuous = wrap("cor", size = 4.75, alignPercent = 1)))
K-means creates some clusters, but it’s hard to say anything about them. There seems to be some meaning to the two clusters though judging by the charts. The clusters seem to overlap quite much.
The dataset is a joined dataset from two sets from the United Nations Development Programme. The data deals with statistics about countries. We’ve created two more variables into the data: female to male education ratio and female to male labor participation ratio.
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ femaleToMaleEducationRatio: num 1.007 0.997 0.983 0.989 0.969 ...
## $ labourParticipationRatio : num 0.891 0.819 0.825 0.884 0.829 ...
## $ expectedYearsOfEducation : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ lifeExpectancy : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ maternalMortalityRatio : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adolescentBirthRate : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ representationInParliament: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## femaleToMaleEducationRatio labourParticipationRatio
## Min. :0.1717 Min. :0.1857
## 1st Qu.:0.7264 1st Qu.:0.5984
## Median :0.9375 Median :0.7535
## Mean :0.8529 Mean :0.7074
## 3rd Qu.:0.9968 3rd Qu.:0.8535
## Max. :1.4967 Max. :1.0380
## expectedYearsOfEducation lifeExpectancy gni
## Min. : 5.40 Min. :49.00 Min. : 581
## 1st Qu.:11.25 1st Qu.:66.30 1st Qu.: 4198
## Median :13.50 Median :74.20 Median : 12040
## Mean :13.18 Mean :71.65 Mean : 17628
## 3rd Qu.:15.20 3rd Qu.:77.25 3rd Qu.: 24512
## Max. :20.20 Max. :83.50 Max. :123124
## maternalMortalityRatio adolescentBirthRate representationInParliament
## Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 49.0 Median : 33.60 Median :19.30
## Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :1100.0 Max. :204.80 Max. :57.50
The maximum values of the variables we’ve introduced seem pretty interesting. The maximum value for female to male education ratio is 1.5, which belongs to Gabon. A bit surprisingly the minimum labor participation ratio is as high as 18.5%, considering there are some countries in our data which have very limited women’s rights.
correlation_matrix <- cor(human) %>% round(digits = 2)
corrplot(correlation_matrix, method = "circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
The correlation matrix seems to indicate some heavy negative correlation between education and birth rate, which could explain the current situation in western countries. Mothers seem to die less if they’re educated as well.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, cex = c(2, 2), col = c("#333333", "#dd4814"))
The plot looks very one dimensional, so there’s not much to say here. The only arrow visible is GNI. Maybe standardizing will help.
human_scaled <- scale(human)
# perform principal component analysis (with the SVD method)
pca_human_scaled <- prcomp(human_scaled)
# s <- summary(pca_human_scaled)
# rounded percetanges of variance captured by each PC
# pca_pr <- round(100*s$importance[2, ], digits = 1)
# create object pc_lab to be used as axis labels
# pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human_scaled, cex = c(2, 2), col = c("#333333", "#dd4814"), xlab = "Well being", ylab = "Equality")
After standardizing the plot begins to make sense. It looks like education and life expectancy are closely related, birth rate and maternal mortality are highly related. Representation and parliament and labor ratio seems to be related, but not as strongly as the ones mentioned previously. We can also create a nicer plot of these findings:
Seems like pretty much all the countries at the right side of the plot are experiencing high birth rate and maternal mortality, and pretty much all the countries seem to be from Africa.
PC1 (x-axis) seems to represent well being of a country, since it deals with life and death (and also education). PC2 (y-axis) is a bit more complicated one to decide. It could be labeled as equality, since it consists of labor participation and representation in the government.
We’re going to look at some tea data provided by Factominer. It consists of data about how people drink their tea.
I selected some interesting variables for us to look at:
data(tea)
tea_columns_to_keep <- c("Tea", "How", "how", "sugar", "breakfast", "sex", "SPC", "relaxing")
custom_tea <- dplyr::select(tea, dplyr::one_of(tea_columns_to_keep))
gather(custom_tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar(fill = "#dd4814") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
It seems like breakfast tea is very popular, and there’s pretty much a 50% chance of sugar use.
# multiple correspondence analysis
mca <- MCA(custom_tea, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = custom_tea, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.204 0.178 0.171 0.157 0.150 0.143
## % of var. 9.586 8.385 8.045 7.384 7.073 6.719
## Cumulative % of var. 9.586 17.971 26.016 33.400 40.473 47.192
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.136 0.127 0.122 0.115 0.114 0.106
## % of var. 6.408 5.986 5.753 5.431 5.384 4.977
## Cumulative % of var. 53.600 59.586 65.339 70.770 76.154 81.131
## Dim.13 Dim.14 Dim.15 Dim.16 Dim.17
## Variance 0.099 0.087 0.078 0.075 0.062
## % of var. 4.654 4.073 3.687 3.548 2.908
## Cumulative % of var. 85.785 89.858 93.545 97.092 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.590 0.569 0.173 | 0.317 0.188 0.050 | 0.088
## 2 | 0.806 1.063 0.282 | -0.123 0.028 0.007 | -0.337
## 3 | -0.314 0.161 0.041 | -0.326 0.199 0.045 | 0.203
## 4 | -0.685 0.769 0.408 | 0.313 0.184 0.085 | 0.174
## 5 | -0.114 0.021 0.010 | 0.104 0.020 0.009 | -0.217
## 6 | -0.471 0.363 0.196 | -0.039 0.003 0.001 | 0.093
## 7 | 0.251 0.103 0.037 | 0.456 0.389 0.123 | -0.309
## 8 | 0.463 0.350 0.099 | -0.235 0.103 0.026 | -0.158
## 9 | 0.479 0.375 0.101 | 0.701 0.920 0.217 | -0.726
## 10 | 0.687 0.772 0.216 | 0.238 0.106 0.026 | -0.602
## ctr cos2
## 1 0.015 0.004 |
## 2 0.222 0.049 |
## 3 0.080 0.017 |
## 4 0.059 0.026 |
## 5 0.092 0.038 |
## 6 0.017 0.008 |
## 7 0.186 0.057 |
## 8 0.049 0.012 |
## 9 1.028 0.232 |
## 10 0.706 0.166 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.925 12.947 0.280 9.151 | -0.320 1.774
## Earl Grey | -0.441 7.679 0.351 -10.242 | 0.170 1.305
## green | 0.506 1.725 0.032 3.073 | -0.276 0.590
## alone | -0.175 1.218 0.057 -4.118 | -0.296 4.005
## lemon | -0.130 0.113 0.002 -0.788 | 0.427 1.410
## milk | 0.439 2.488 0.051 3.917 | 0.779 8.934
## other | 1.186 2.588 0.043 3.606 | -0.597 0.751
## tea bag | -0.210 1.538 0.058 -4.159 | 0.126 0.630
## tea bag+unpackaged | -0.001 0.000 0.000 -0.012 | -0.121 0.324
## unpackaged | 0.996 7.303 0.135 6.359 | -0.277 0.648
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.034 -3.168 | -0.408 2.996 0.054 -4.033 |
## Earl Grey 0.052 3.949 | -0.065 0.198 0.008 -1.505 |
## green 0.009 -1.681 | 1.293 13.447 0.207 7.860 |
## alone 0.163 -6.984 | 0.204 1.987 0.078 4.818 |
## lemon 0.023 2.599 | 0.345 0.958 0.015 2.099 |
## milk 0.161 6.942 | -0.550 4.642 0.080 -4.902 |
## other 0.011 -1.816 | -1.847 7.484 0.106 -5.617 |
## tea bag 0.021 2.488 | 0.071 0.209 0.007 1.404 |
## tea bag+unpackaged 0.007 -1.417 | -0.554 7.037 0.140 -6.473 |
## unpackaged 0.010 -1.772 | 1.112 10.844 0.169 7.099 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.364 0.052 0.228 |
## How | 0.104 0.215 0.206 |
## how | 0.144 0.023 0.247 |
## sugar | 0.149 0.353 0.018 |
## breakfast | 0.054 0.044 0.339 |
## sex | 0.089 0.454 0.036 |
## SPC | 0.584 0.284 0.217 |
## relaxing | 0.140 0.000 0.077 |
# visualize MCA
plot(mca, invisible = c("ind"), habillage = "quali")
The percentages of explained variance are pretty low, and that can be seen from the plot as well: most of the variables seem to be around the center.
Milk use and being a men seem to be related, as well as being an employee and having no breakfast, which makes sense. Employees also seem to drink their tea alone. Non working people also seem to not put sugar in their tea. Black tea seems to be mostly unpackaged.
Students, workmen and seniors stand out in the plot. They do not seem be related with any of the other groups.